1 Problem Statement:

Build the right model to predict the “Energy Star score” of a building using the provided data.

1.1 Execuitive Summary

The NYC Benchmarking Law requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking. The law standardizes this process by requiring building owners to enter their annual energy and water use in the U.S. Environmental Protection Agency’s (EPA) online tool, Energy Star Portfolio Manager and use the tool to submit data. This data informs building owners about a building’s energy and water consumption compared to similar buildings, and tracks progress year over year to help in energy efficiency planning

An energy efficiency score is the Energy Star Rating that a building earns using the United States Environmental Protection Agency online benchmarking tool, Energy Star Portfolio Manager, to compare building energy performance to similar buildings in similar climates.

2 My thought process

After going through the entire assignment I choose to work upon this problem as I am confident to complete this within the given timeframe of 2 days and also this problem will allow me to demonstrate my capabilities in a better way, which probably will also be advantageous for the team to decide upon my candidature.

My first objective is to get a good understanding of the dataset followed by exploratory data analysis, data sanity check, feature engineering, feature selection, model selection, stacking, scoring and any future work if required.

I will be highlighting all important notes and points wherever required, just like the way this line is highlighted

Since the target variable is bounded between 1 and 100 and total unique categories/levels for this variable is less than 100 hence I think tree based models like xgboost, gbm, RF can work better here because they partition the feature space and then aggregate the information at partitioned space level.

Lets deep dive and see whats there in the Data.

3 Loading and Exploring Data

3.1 Loading required libraries

Loading R packages used besides base R.

library(knitr)
library(ggplot2)
library(plyr)
library(dplyr)
library(caret)
library(gridExtra)
library(scales)
library(randomForest)
library(psych)
library(xgboost)
library(readr)
library(data.table)
library(DT)
library(DataExplorer)

3.2 Loading the dataset

setwd("C:/Users/sysadmin/Downloads/Assignment/")
raw_data<-read_csv("Usecase1_Dataset.csv")
DT::datatable(head(raw_data,100))

3.3 Data definition and data assumptions

Here I am trying to put my understanding of the data and what each column mean. I might be wrong but for the time being I would continue with my assumptions about the data. If my candidature goes forward we can discuss on these or any of the assumptions I would be taking in this entire notebook.

3.4 Data Exploration

The information presented below is just a first impression of the data, we will keep on changing variable type based the kind of information it carries.

–> Also the very first glance at the data set I could see missing values being represented as “Not Available”, we will change it to missing or ‘0’ based on the information present in that variable

datatable(introduce(raw_data))
plot_intro(raw_data)

lets replace the ‘Not Available’ as missing information in our data set.

raw_data[raw_data=='Not Available']=NA

Now lets visualize the missing information in the data

# Custom function
plot_missing_2 <-
function (data, group = list(Good = 0.05, Okay = 0.4, Poor = 0.8, 
  Scarce =  1), geom_label_args = list(), title = NULL, ggtheme = theme_gray(), 
theme_config = list(legend.position = c("bottom"))) 
{
  pct_missing <- Band <- NULL
  missing_value <- data.table(profile_missing(data))
  group <- group[sort.list(unlist(group))]
  invisible(lapply(seq_along(group), function(i) {
    if (i == 1) {
      missing_value[pct_missing <= group[[i]], `:=`(Band,
         names(group)[i])]
    } else {
  missing_value[pct_missing > group[[i - 1]] & pct_missing <= 
     group[[i]], `:=`(Band, names(group)[i])]
    }
}))
  output <- ggplot(missing_value, aes_string(x = "feature", 
    y = "num_missing", fill = "Band")) + geom_bar(stat = "identity") + 
   scale_fill_manual("Band", values = c("Good"="green2","Okay"="gold","Poor"="darkorange","Scarce"="firebrick2")) + coord_flip() + xlab("Features") + 
   ylab("Missing Rows")
  geom_label_args_list <- list(mapping = aes(label = paste0(round(100 * 
    pct_missing, 2), "%")))
  output <- output + do.call("geom_label", c(geom_label_args_list, 
     geom_label_args))
  class(output) <- c("single", class(output))
  plotDataExplorer(plot_obj = output, title = title, ggtheme = ggtheme, 
   theme_config = theme_config)
}

plot_missing_2(raw_data)

There seems to be a lot of missing values we will handle them carefully in our next sections.

4 Data Transformation

Here in this section we will be changing the data type of the variables which might have been read as a character/factor variable because of the presence of any unwanted string values but we will also have to be cautious while transforming some of the numeric columns to factors(It might result in a lot of factor levels)–>I mean binning.

datatable(str(raw_data))
## tibble [11,746 x 60] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Order                                                     : num [1:11746] 1 2 3 4 5 6 7 10 11 12 ...
##  $ Property Id                                               : num [1:11746] 13286 28400 4778226 4778267 4778288 ...
##  $ Property Name                                             : chr [1:11746] "201/205" "NYP Columbia (West Campus)" "MSCHoNY North" "Herbert Irving Pavilion & Millstein Hospital" ...
##  $ Parent Property Id                                        : chr [1:11746] "13286" "28400" "28400" "28400" ...
##  $ Parent Property Name                                      : chr [1:11746] "201/205" "NYP Columbia (West Campus)" "NYP Columbia (West Campus)" "NYP Columbia (West Campus)" ...
##  $ BBL - 10 digits                                           : chr [1:11746] "1013160001" "1021380040" "1021380030" "1021390001" ...
##  $ NYC Borough, Block and Lot (BBL) self-reported            : chr [1:11746] "1013160001" "1-02138-0040" "1-02138-0030" "1-02139-0001" ...
##  $ NYC Building Identification Number (BIN)                  : chr [1:11746] "1037549" "1084198; 1084387;1084385; 1084386; 1084388; 1084389; 1807867; 1809824" "1063380" "1087281; 1076746" ...
##  $ Address 1 (self-reported)                                 : chr [1:11746] "201/205  East  42nd  st." "622 168th Street" "3975 Broadway" "161 Fort Washington Ave" ...
##  $ Address 2                                                 : chr [1:11746] NA NA NA "177 Fort Washington Ave" ...
##  $ Postal Code                                               : chr [1:11746] "10017" "10032" "10032" "10032" ...
##  $ Street Number                                             : chr [1:11746] "675" "180" "3975" "161" ...
##  $ Street Name                                               : chr [1:11746] "3 AVENUE" "FT WASHINGTON AVENUE" "BROADWAY" "FT WASHINGTON AVENUE" ...
##  $ Borough                                                   : chr [1:11746] "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
##  $ DOF Gross Floor Area                                      : num [1:11746] 289356 3693539 152765 891040 211400 ...
##  $ Primary Property Type - Self Selected                     : chr [1:11746] "Office" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" ...
##  $ List of All Property Use Types at Property                : chr [1:11746] "Office" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" ...
##  $ Largest Property Use Type                                 : chr [1:11746] "Office" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" ...
##  $ Largest Property Use Type - Gross Floor Area (ft²)        : chr [1:11746] "293447" "3889181" "231342" "1305748" ...
##  $ 2nd Largest Property Use Type                             : chr [1:11746] NA NA NA NA ...
##  $ 2nd Largest Property Use - Gross Floor Area (ft²)         : chr [1:11746] NA NA NA NA ...
##  $ 3rd Largest Property Use Type                             : chr [1:11746] NA NA NA NA ...
##  $ 3rd Largest Property Use Type - Gross Floor Area (ft²)    : chr [1:11746] NA NA NA NA ...
##  $ Year Built                                                : num [1:11746] 1963 1969 1924 1971 1932 ...
##  $ Number of Buildings - Self-reported                       : num [1:11746] 2 12 1 1 1 12 1 4 1 1 ...
##  $ Occupancy                                                 : num [1:11746] 100 100 100 100 100 100 60 100 100 100 ...
##  $ Metered Areas (Energy)                                    : chr [1:11746] "Whole Building" "Whole Building" NA NA ...
##  $ Metered Areas  (Water)                                    : chr [1:11746] NA "Whole Building" NA NA ...
##  $ ENERGY STAR Score                                         : chr [1:11746] NA "55" NA NA ...
##  $ Site EUI (kBtu/ft²)                                       : chr [1:11746] "305.6" "229.8" NA NA ...
##  $ Weather Normalized Site EUI (kBtu/ft²)                    : chr [1:11746] "303.1" "228.8" NA NA ...
##  $ Weather Normalized Site Electricity Intensity (kWh/ft²)   : chr [1:11746] "37.8" "24.8" NA NA ...
##  $ Weather Normalized Site Natural Gas Intensity (therms/ft²): chr [1:11746] NA "2.4" NA NA ...
##  $ Weather Normalized Source EUI (kBtu/ft²)                  : chr [1:11746] "614.2" "401.1" NA NA ...
##  $ Fuel Oil #1 Use (kBtu)                                    : chr [1:11746] NA NA NA NA ...
##  $ Fuel Oil #2 Use (kBtu)                                    : chr [1:11746] NA "1.96248472E7" NA NA ...
##  $ Fuel Oil #4 Use (kBtu)                                    : chr [1:11746] NA NA NA NA ...
##  $ Fuel Oil #5 & 6 Use (kBtu)                                : chr [1:11746] NA NA NA NA ...
##  $ Diesel #2 Use (kBtu)                                      : chr [1:11746] NA NA NA NA ...
##  $ District Steam Use (kBtu)                                 : chr [1:11746] "5.15506751E7" "-3.914148026E8" NA NA ...
##  $ Natural Gas Use (kBtu)                                    : chr [1:11746] NA "933073441" NA NA ...
##  $ Weather Normalized Site Natural Gas Use (therms)          : chr [1:11746] NA "9330734.4" NA NA ...
##  $ Electricity Use - Grid Purchase (kBtu)                    : chr [1:11746] "38139374.2" "332365924" NA NA ...
##  $ Weather Normalized Site Electricity (kWh)                 : chr [1:11746] "1.10827705E7" "9.62613121E7" NA NA ...
##  $ Total GHG Emissions (Metric Tons CO2e)                    : chr [1:11746] "6962.2" "55870.4" "0" "0" ...
##  $ Direct GHG Emissions (Metric Tons CO2e)                   : chr [1:11746] "0" "51016.4" "0" "0" ...
##  $ Indirect GHG Emissions (Metric Tons CO2e)                 : chr [1:11746] "6962.2" "4854.1" "0" "0" ...
##  $ Property GFA - Self-Reported (ft²)                        : num [1:11746] 762051 3889181 231342 1305748 179694 ...
##  $ Water Use (All Water Sources) (kgal)                      : chr [1:11746] NA NA NA NA ...
##  $ Water Intensity (All Water Sources) (gal/ft²)             : chr [1:11746] NA NA NA NA ...
##  $ Source EUI (kBtu/ft²)                                     : chr [1:11746] "619.4" "404.3" NA NA ...
##  $ Release Date                                              : chr [1:11746] "05/01/2017 05:32:03 PM" "04/27/2017 11:23:27 AM" "04/27/2017 11:23:27 AM" "04/27/2017 11:23:27 AM" ...
##  $ Water Required?                                           : chr [1:11746] "No" "No" "No" "No" ...
##  $ DOF Benchmarking Submission Status                        : chr [1:11746] "In Compliance" "In Compliance" "In Compliance" "In Compliance" ...
##  $ Latitude                                                  : num [1:11746] 40.8 40.8 40.8 40.8 40.8 ...
##  $ Longitude                                                 : num [1:11746] -74 -73.9 -73.9 -73.9 -73.9 ...
##  $ Community Board                                           : num [1:11746] 6 12 12 12 12 8 8 13 13 13 ...
##  $ Council District                                          : num [1:11746] 4 10 10 10 10 5 5 23 23 23 ...
##  $ Census Tract                                              : num [1:11746] 88 251 251 255 255 ...
##  $ NTA                                                       : chr [1:11746] "Turtle Bay-East Midtown" "Washington Heights South" "Washington Heights South" "Washington Heights South" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Order = col_double(),
##   ..   `Property Id` = col_double(),
##   ..   `Property Name` = col_character(),
##   ..   `Parent Property Id` = col_character(),
##   ..   `Parent Property Name` = col_character(),
##   ..   `BBL - 10 digits` = col_character(),
##   ..   `NYC Borough, Block and Lot (BBL) self-reported` = col_character(),
##   ..   `NYC Building Identification Number (BIN)` = col_character(),
##   ..   `Address 1 (self-reported)` = col_character(),
##   ..   `Address 2` = col_character(),
##   ..   `Postal Code` = col_character(),
##   ..   `Street Number` = col_character(),
##   ..   `Street Name` = col_character(),
##   ..   Borough = col_character(),
##   ..   `DOF Gross Floor Area` = col_double(),
##   ..   `Primary Property Type - Self Selected` = col_character(),
##   ..   `List of All Property Use Types at Property` = col_character(),
##   ..   `Largest Property Use Type` = col_character(),
##   ..   `Largest Property Use Type - Gross Floor Area (ft²)` = col_character(),
##   ..   `2nd Largest Property Use Type` = col_character(),
##   ..   `2nd Largest Property Use - Gross Floor Area (ft²)` = col_character(),
##   ..   `3rd Largest Property Use Type` = col_character(),
##   ..   `3rd Largest Property Use Type - Gross Floor Area (ft²)` = col_character(),
##   ..   `Year Built` = col_double(),
##   ..   `Number of Buildings - Self-reported` = col_double(),
##   ..   Occupancy = col_double(),
##   ..   `Metered Areas (Energy)` = col_character(),
##   ..   `Metered Areas  (Water)` = col_character(),
##   ..   `ENERGY STAR Score` = col_character(),
##   ..   `Site EUI (kBtu/ft²)` = col_character(),
##   ..   `Weather Normalized Site EUI (kBtu/ft²)` = col_character(),
##   ..   `Weather Normalized Site Electricity Intensity (kWh/ft²)` = col_character(),
##   ..   `Weather Normalized Site Natural Gas Intensity (therms/ft²)` = col_character(),
##   ..   `Weather Normalized Source EUI (kBtu/ft²)` = col_character(),
##   ..   `Fuel Oil #1 Use (kBtu)` = col_character(),
##   ..   `Fuel Oil #2 Use (kBtu)` = col_character(),
##   ..   `Fuel Oil #4 Use (kBtu)` = col_character(),
##   ..   `Fuel Oil #5 & 6 Use (kBtu)` = col_character(),
##   ..   `Diesel #2 Use (kBtu)` = col_character(),
##   ..   `District Steam Use (kBtu)` = col_character(),
##   ..   `Natural Gas Use (kBtu)` = col_character(),
##   ..   `Weather Normalized Site Natural Gas Use (therms)` = col_character(),
##   ..   `Electricity Use - Grid Purchase (kBtu)` = col_character(),
##   ..   `Weather Normalized Site Electricity (kWh)` = col_character(),
##   ..   `Total GHG Emissions (Metric Tons CO2e)` = col_character(),
##   ..   `Direct GHG Emissions (Metric Tons CO2e)` = col_character(),
##   ..   `Indirect GHG Emissions (Metric Tons CO2e)` = col_character(),
##   ..   `Property GFA - Self-Reported (ft²)` = col_double(),
##   ..   `Water Use (All Water Sources) (kgal)` = col_character(),
##   ..   `Water Intensity (All Water Sources) (gal/ft²)` = col_character(),
##   ..   `Source EUI (kBtu/ft²)` = col_character(),
##   ..   `Release Date` = col_character(),
##   ..   `Water Required?` = col_character(),
##   ..   `DOF Benchmarking Submission Status` = col_character(),
##   ..   Latitude = col_double(),
##   ..   Longitude = col_double(),
##   ..   `Community Board` = col_double(),
##   ..   `Council District` = col_double(),
##   ..   `Census Tract` = col_double(),
##   ..   NTA = col_character()
##   .. )
raw_data<-as.data.frame(raw_data)
cols <- colnames(raw_data[,c(19,21,23,29:51)])
raw_data[cols] <- sapply(raw_data[cols],as.numeric)
raw_data$`ENERGY STAR Score`<-as.numeric(raw_data$`ENERGY STAR Score`)

5 Insights gathered so far

  1. Lets look at the ‘NYC Borough, Block and Lot (BBL) self-reported’, ‘BBL - 10 digits’ and ‘Borough’ variables.
datatable(raw_data[,c('NYC Borough, Block and Lot (BBL) self-reported','BBL - 10 digits','Borough')])

It seems that all these columns contains almost the same information. Probably the first digit in ‘NYC Borough, Block and Lot (BBL) self-reported’ contains the information about ‘Borough’ next five digits ‘Tax Block’ and last four ‘Tax Lot’. Since there are less missing values in ‘Borough’.

  1. ‘Address 1 (self-reported)’, ‘Address 2’, ‘Postal Code’,‘Street Name’, ‘Street Number’,‘Latitude’ and ‘Longitude’ contain information about the localization.I’ll keep only ‘Postal Code’ as it seems to be the most generic information about it, besides ‘Borough’. There may be additional or more general information about the location at ‘Census Tract’, ‘Community Board’, ‘Council District’, but for now I’ll ignore these COLUMNS.

  2. ‘Total GHG Emissions (Metric Tons CO2e)’=‘Direct GHG Emissions (Metric Tons CO2e)’+‘Indirect GHG Emissions (Metric Tons CO2e)’ — I’ll keep the total and the fraction of direct

  3. ‘Property GFA - Self-Reported (ft²)’=‘DOF Gross Floor Area’, so I’ll drop the ‘DOF Gross Floor Area’ that has missing values

  4. ‘Largest Property Use Type - Gross Floor Area (ft²)’,‘2nd Largest Property Use - Gross Floor Area (ft²)’ and ‘3rd Largest Property Use Type - Gross Floor Area (ft²)’ are part of ‘Property GFA - Self-Reported (ft²)’– I’ll create and keep the fractions they represent and drop the original values

6 Few Feature enginnered columns

6.1 Feature Engineering

raw_data<-as.data.frame(raw_data)
raw_data$largest_property_use_rate<-raw_data$`Largest Property Use Type - Gross Floor Area (ft²)`/raw_data$`Property GFA - Self-Reported (ft²)`

raw_data$second_property_use_rate <- raw_data$`2nd Largest Property Use - Gross Floor Area (ft²)`/raw_data$`Property GFA - Self-Reported (ft²)`

raw_data$third_property_use_rate <- raw_data$`3rd Largest Property Use Type - Gross Floor Area (ft²)`/raw_data$`Property GFA - Self-Reported (ft²)`

raw_data$Direct_GHG_Emissions_rate<-raw_data$`Direct GHG Emissions (Metric Tons CO2e)`/raw_data$`Total GHG Emissions (Metric Tons CO2e)`

raw_data[raw_data==""]=NA
raw_data[raw_data==" "]=NA

7 Missing value treatment

After the transformation and removal of blank spaces from the data, lets see if the missing information has changed from what we observed earlier:

plot_missing_2(raw_data)

There is a difference meaning there were blank spaces which has resulted in additional missing information in the data.

Lets treat the missing variables which are important to us:

  1. Replacing missing information in ‘Borough’ with ‘BBL - 10 digits’ and creating new features namely: “Lot” and “Block”
raw_data$Block<-substr(raw_data$`BBL - 10 digits`,2,6)
raw_data$lot<-substr(raw_data$`BBL - 10 digits`,7,10)
  1. We still have a lot of missing values. Here I will convert some missing values into zero because they are actually zero, transform missing values at 2nd and 3rd Property Use Type to ‘none’ and then drop some of the columns that I believe are irrelevant for the analysis.
library(tidyr)

zero_col = c('second_property_use_rate','third_property_use_rate','Water Intensity (All Water Sources) (gal/ft²)','Weather Normalized Site Natural Gas Intensity (therms/ft²)','Total GHG Emissions (Metric Tons CO2e)','Weather Normalized Site Electricity Intensity (kWh/ft²)','Direct_GHG_Emissions_rate','Total GHG Emissions (Metric Tons CO2e)')

plot_missing_2(raw_data[zero_col])

raw_data<-raw_data %>%
  mutate_at(vars(zero_col), ~replace_na(., 0))

prop_col = c('2nd Largest Property Use Type','3rd Largest Property Use Type')

raw_data<-raw_data %>%
  mutate_at(vars(prop_col), ~replace_na(., 'none'))
columns_to_drop = c('NYC Borough, Block and Lot (BBL) self-reported',
            'NYC Building Identification Number (BIN)',
            'BBL - 10 digits',
            'Parent Property Name',
            'Property Name',
            'Address 1 (self-reported)',
            'Address 2',
            'Street Number',
            'Street Name',
            'Latitude',
            'Longitude',
            'DOF Gross Floor Area',
            'DOF Benchmarking Submission Status',
            'List of All Property Use Types at Property',
            'Largest Property Use Type - Gross Floor Area (ft²)',
            '2nd Largest Property Use - Gross Floor Area (ft²)',
            '3rd Largest Property Use Type - Gross Floor Area (ft²)',                     
            'Fuel Oil #1 Use (kBtu)',                                         
            'Fuel Oil #2 Use (kBtu)',                                         
            'Fuel Oil #4 Use (kBtu)',                                         
            'Fuel Oil #5 & 6 Use (kBtu)',                                     
            'Diesel #2 Use (kBtu)',                                           
            'District Steam Use (kBtu)',                                      
            'Natural Gas Use (kBtu)',                                         
            'Weather Normalized Site Natural Gas Use (therms)',               
            'Electricity Use - Grid Purchase (kBtu)',                         
            'Weather Normalized Site Electricity (kWh)',                      
            'Direct GHG Emissions (Metric Tons CO2e)',                        
            'Indirect GHG Emissions (Metric Tons CO2e)',                      
            'Property GFA - Self-Reported (ft²)',                              
            'Water Use (All Water Sources) (kgal)', 
            'Weather Normalized Site EUI (kBtu/ft²)',
            'Weather Normalized Source EUI (kBtu/ft²)')

raw_data1<-raw_data[!colnames(raw_data)%in%columns_to_drop]

8 Binning Property type variables

All property type columns have a lot of levels we need to bin some of the levels which are not relevent/redundant/least frequent. There are many ways of binning levels of a factor variable but for now I would restrict myself to frequency based renaming and binning.

condenseMe <- function(vector, threshold = 0.002, newName = "Other") {
  toCondense <- names(which(prop.table(table(vector)) < threshold))
  vector[vector %in% toCondense] <- newName
  vector
}



raw_data1$`Largest Property Use Type`<-condenseMe(raw_data1$`Largest Property Use Type`)
raw_data1$`2nd Largest Property Use Type`<-condenseMe(raw_data1$`2nd Largest Property Use Type`)
raw_data1$`3rd Largest Property Use Type`<-condenseMe(raw_data1$`3rd Largest Property Use Type`)

Lets check the unique values in these three columns. Now after frequency based binning the levels of these factors have reduced to 17,16,10 respectively

sapply(raw_data1, uniqueN)
##                                                      Order 
##                                                      11746 
##                                                Property Id 
##                                                      11746 
##                                         Parent Property Id 
##                                                        102 
##                                                Postal Code 
##                                                        286 
##                                                    Borough 
##                                                          6 
##                      Primary Property Type - Self Selected 
##                                                         55 
##                                  Largest Property Use Type 
##                                                         18 
##                              2nd Largest Property Use Type 
##                                                         16 
##                              3rd Largest Property Use Type 
##                                                         10 
##                                                 Year Built 
##                                                        157 
##                        Number of Buildings - Self-reported 
##                                                         49 
##                                                  Occupancy 
##                                                         19 
##                                     Metered Areas (Energy) 
##                                                          8 
##                                     Metered Areas  (Water) 
##                                                          7 
##                                          ENERGY STAR Score 
##                                                        101 
##                                        Site EUI (kBtu/ft²) 
##                                                       1959 
##    Weather Normalized Site Electricity Intensity (kWh/ft²) 
##                                                        440 
## Weather Normalized Site Natural Gas Intensity (therms/ft²) 
##                                                         65 
##                     Total GHG Emissions (Metric Tons CO2e) 
##                                                       7817 
##              Water Intensity (All Water Sources) (gal/ft²) 
##                                                       5606 
##                                      Source EUI (kBtu/ft²) 
##                                                       2920 
##                                               Release Date 
##                                                       3537 
##                                            Water Required? 
##                                                          3 
##                                            Community Board 
##                                                         20 
##                                           Council District 
##                                                         45 
##                                               Census Tract 
##                                                        808 
##                                                        NTA 
##                                                        145 
##                                  largest_property_use_rate 
##                                                       3132 
##                                   second_property_use_rate 
##                                                       3434 
##                                    third_property_use_rate 
##                                                       1400 
##                                  Direct_GHG_Emissions_rate 
##                                                      10649 
##                                                      Block 
##                                                       4068 
##                                                        lot 
##                                                        483

Lets look at the individual distribution of target with respect to these frequent levels of property use type variable.

  1. Largest Property use type: This does gives an indication that some of the levels are actually important in predicting the target variable like parking.
dat<-raw_data1[,c(7:9,15)]
colnames(dat)<-make.names(colnames(dat))

library(plotly)
fig <- plot_ly(dat, y = ~ENERGY.STAR.Score, color = ~Largest.Property.Use.Type, type = "box")

fig
  1. 2nd Largest property use type
library(plotly)
fig <- plot_ly(dat, y = ~ENERGY.STAR.Score, color = ~X2nd.Largest.Property.Use.Type, type = "box")

fig
  1. 3rd Largest property use type
library(plotly)
fig <- plot_ly(dat, y = ~ENERGY.STAR.Score, color = ~X3rd.Largest.Property.Use.Type, type = "box")

fig

9 Missing values imputation and removing some other redundant columns

fIRST lets see the proportion of missing values in all the columns

prop_NA <- apply(raw_data1, 2, function(x) round(sum(is.na(x))/length(x), digits = 4))
# ordenando os resultados
prop_NA[order(prop_NA)]
##                                                      Order 
##                                                     0.0000 
##                                                Property Id 
##                                                     0.0000 
##                                         Parent Property Id 
##                                                     0.0000 
##                                                Postal Code 
##                                                     0.0000 
##                      Primary Property Type - Self Selected 
##                                                     0.0000 
##                              2nd Largest Property Use Type 
##                                                     0.0000 
##                              3rd Largest Property Use Type 
##                                                     0.0000 
##                                                 Year Built 
##                                                     0.0000 
##                        Number of Buildings - Self-reported 
##                                                     0.0000 
##                                                  Occupancy 
##                                                     0.0000 
##    Weather Normalized Site Electricity Intensity (kWh/ft²) 
##                                                     0.0000 
## Weather Normalized Site Natural Gas Intensity (therms/ft²) 
##                                                     0.0000 
##                     Total GHG Emissions (Metric Tons CO2e) 
##                                                     0.0000 
##              Water Intensity (All Water Sources) (gal/ft²) 
##                                                     0.0000 
##                                               Release Date 
##                                                     0.0000 
##                                   second_property_use_rate 
##                                                     0.0000 
##                                    third_property_use_rate 
##                                                     0.0000 
##                                  Direct_GHG_Emissions_rate 
##                                                     0.0000 
##                                  Largest Property Use Type 
##                                                     0.0002 
##                                  largest_property_use_rate 
##                                                     0.0002 
##                                                      Block 
##                                                     0.0009 
##                                                        lot 
##                                                     0.0009 
##                                     Metered Areas (Energy) 
##                                                     0.0049 
##                                                    Borough 
##                                                     0.0100 
##                                            Water Required? 
##                                                     0.0100 
##                                        Site EUI (kBtu/ft²) 
##                                                     0.0139 
##                                      Source EUI (kBtu/ft²) 
##                                                     0.0139 
##                                          ENERGY STAR Score 
##                                                     0.1791 
##                                            Community Board 
##                                                     0.1927 
##                                           Council District 
##                                                     0.1927 
##                                               Census Tract 
##                                                     0.1927 
##                                                        NTA 
##                                                     0.1927 
##                                     Metered Areas  (Water) 
##                                                     0.3924

Lets visualize these missing values to have an idea about where in the entire data are they missing

library(naniar)
gg_miss_upset(raw_data1)

vis_miss(raw_data1)

We can see proper intersection of rows. Most of the missing variables are having common missing rows.So we can tryout removing rows where the target variables is missing and see if it reduces the amount of missingness in other columns.

Removing observations where target variable is missing and also removing some more redundant variables like Metered Areas (Energy) & Metered Areas (Water) which has almost contant variance. Also removing water required? variable as there is no difference in the target variables distribution for yes or no values of water required?

raw_data1<-raw_data1[!is.na(raw_data1$`ENERGY STAR Score`),]

tapply(raw_data1$`ENERGY STAR Score`,raw_data1$`Water Required?`, summary)
## $No
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   37.00   65.00   59.82   85.00  100.00 
## 
## $Yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    36.0    65.0    59.8    85.0   100.0
remv<-c('Metered Areas (Energy)','Metered Areas  (Water)','Release Date','Water Required?','Community Board','Council District','Census Tract','NTA')

raw_data1<-raw_data1[!colnames(raw_data1)%in% remv]
prop_NA <- apply(raw_data1, 2, function(x) round(sum(is.na(x))/length(x), digits = 4))
# ordenando os resultados
prop_NA[order(prop_NA)]
##                                                      Order 
##                                                     0.0000 
##                                                Property Id 
##                                                     0.0000 
##                                         Parent Property Id 
##                                                     0.0000 
##                                                Postal Code 
##                                                     0.0000 
##                      Primary Property Type - Self Selected 
##                                                     0.0000 
##                                  Largest Property Use Type 
##                                                     0.0000 
##                              2nd Largest Property Use Type 
##                                                     0.0000 
##                              3rd Largest Property Use Type 
##                                                     0.0000 
##                                                 Year Built 
##                                                     0.0000 
##                        Number of Buildings - Self-reported 
##                                                     0.0000 
##                                                  Occupancy 
##                                                     0.0000 
##                                          ENERGY STAR Score 
##                                                     0.0000 
##                                        Site EUI (kBtu/ft²) 
##                                                     0.0000 
##    Weather Normalized Site Electricity Intensity (kWh/ft²) 
##                                                     0.0000 
## Weather Normalized Site Natural Gas Intensity (therms/ft²) 
##                                                     0.0000 
##                     Total GHG Emissions (Metric Tons CO2e) 
##                                                     0.0000 
##              Water Intensity (All Water Sources) (gal/ft²) 
##                                                     0.0000 
##                                      Source EUI (kBtu/ft²) 
##                                                     0.0000 
##                                  largest_property_use_rate 
##                                                     0.0000 
##                                   second_property_use_rate 
##                                                     0.0000 
##                                    third_property_use_rate 
##                                                     0.0000 
##                                  Direct_GHG_Emissions_rate 
##                                                     0.0000 
##                                                      Block 
##                                                     0.0002 
##                                                        lot 
##                                                     0.0002 
##                                                    Borough 
##                                                     0.0065

We can impute the missing variables using mode/model based imputation but since the amount of missingness is low I would rather remove them as I don’t have a lot of time left ;)

raw_data1<-raw_data1[complete.cases(raw_data1), ]

Removing columns which won’t be used in the modelling

selection = c('Postal Code','Parent Property Id','Property Id','lot','Block',
        '3rd Largest Property Use Type','2nd Largest Property Use Type',
        'Primary Property Type - Self Selected','Year Built')

raw_data1<-raw_data1[!colnames(raw_data1)%in% selection]

10 One hot encoding

dmy <- dummyVars(" ~ .", data = raw_data1)
train_data <- data.frame(predict(dmy, newdata = raw_data1))

11 Feature selection with Cross validation (RFE)

In this section I will be using Recursive feature elimination using RANDOM FOREST as a technique to gaze into the features which might be important for modelling

control <- rfeControl(functions=rfFuncs, method="cv", number=10,verbose =T)
# run the RFE algorithm
results <- readRDS("results.RDS")
results
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables  RMSE Rsquared    MAE RMSESD RsquaredSD  MAESD Selected
##          2 18.25   0.6833 14.824 1.5612    0.07318 1.3243         
##          4 19.97   0.7021 16.568 0.9259    0.01575 0.8294         
##          8 14.05   0.7941 10.729 0.5411    0.01671 0.5487         
##         10 13.20   0.8075  9.541 0.5367    0.01587 0.2136         
##         15 12.80   0.8167  8.989 0.3975    0.01153 0.1742         
##         31 12.60   0.8226  8.854 0.4194    0.01195 0.1924        *
## 
## The top 5 variables (out of 31):
##    X.Source.EUI..kBtu.ft².., X.Largest.Property.Use.Type.Office, X.Largest.Property.Use.Type.Non.Refrigerated.Warehouse, Direct_GHG_Emissions_rate, X.Largest.Property.Use.Type.Multifamily.Housing

It seems most of the variables we have considered are important for modelling(This is recursive feature elimination method using RF). An Rsq. value of 0.82 is good to start with and also with an MAE of 8.854 it looks like random forest could be a good method to use

It is also visible that using only 15 features out of these 31 features we might also get ~0.89 as MAE. We might want to go with only 15 features as more features would bring in the fear of overfitting

–>Lets look at the order in which the importance has been given by RFE:

results$optVariables
##  [1] "X.Source.EUI..kBtu.ft².."                                         
##  [2] "X.Largest.Property.Use.Type.Office"                               
##  [3] "X.Largest.Property.Use.Type.Non.Refrigerated.Warehouse"           
##  [4] "Direct_GHG_Emissions_rate"                                        
##  [5] "X.Largest.Property.Use.Type.Multifamily.Housing"                  
##  [6] "X.Site.EUI..kBtu.ft².."                                           
##  [7] "X.Largest.Property.Use.Type.Supermarket.Grocery.Store"            
##  [8] "X.Total.GHG.Emissions..Metric.Tons.CO2e.."                        
##  [9] "X.Largest.Property.Use.Type.Hospital..General.Medical...Surgical."
## [10] "X.Largest.Property.Use.Type.Distribution.Center"                  
## [11] "X.Weather.Normalized.Site.Electricity.Intensity..kWh.ft².."       
## [12] "second_property_use_rate"                                         
## [13] "X.Weather.Normalized.Site.Natural.Gas.Intensity..therms.ft².."    
## [14] "largest_property_use_rate"                                        
## [15] "X.Water.Intensity..All.Water.Sources...gal.ft².."                 
## [16] "BoroughManhattan"                                                 
## [17] "X.Largest.Property.Use.Type.Hotel"                                
## [18] "X.Largest.Property.Use.Type.Retail.Store"                         
## [19] "BoroughQueens"                                                    
## [20] "X.Largest.Property.Use.Type.Residence.Hall.Dormitory"             
## [21] "BoroughBronx"                                                     
## [22] "X.Largest.Property.Use.Type.K.12.School"                          
## [23] "X.Largest.Property.Use.Type.Senior.Care.Community"                
## [24] "third_property_use_rate"                                          
## [25] "BoroughBrooklyn"                                                  
## [26] "X.Largest.Property.Use.Type.Other"                                
## [27] "X.Largest.Property.Use.Type.Medical.Office"                       
## [28] "X.Number.of.Buildings...Self.reported."                           
## [29] "Occupancy"                                                        
## [30] "BoroughStaten.Island"                                             
## [31] "X.Largest.Property.Use.Type.Parking"

–>Its good to see that 3 of our feature engineered variables is among the top selected features– “Direct_GHG_Emissions_rate”,“second_property_use_rate” & “largest_property_use_rate”

12 Modelling with Stacking–Cross validated

This section is just to understand which model can be a better choice for solving this problem. A stacked model genrally is better than independent set of models because stacking takes the advantage of modelling variations from different models.

library(caret)
library(devtools)
library(caretEnsemble)
library(mlbench)
library(rsample)
library(dplyr)

#Split train/test
set.seed(123)
library(caTools)
split = sample.split(train_data$X.ENERGY.STAR.Score., SplitRatio = 0.8)
tr = subset(train_data, split == TRUE)
test = subset(train_data, split == FALSE)


folds=5
repeats=1
myControl <- trainControl(method='repeatedcv', number=folds, repeats=repeats, returnResamp='none', 
                          returnData=FALSE, savePredictions=TRUE, 
                          verboseIter=TRUE, allowParallel=TRUE,
                          index=createMultiFolds(train_data$X.ENERGY.STAR.Score., k=folds, times=repeats))
PP <- c('center', 'scale')

#cl <- makePSOCKcluster(10)
#registerDoParallel(cl)
#Train some models

##GBM
model1 <- readRDS("model1.RDS")


#KNN
model3 <- readRDS("model3.RDS")


model4 <- readRDS("model4.RDS")



model5 <- readRDS("model5.RDS")



model6 <- readRDS("model6.RDS")



model7 <- readRDS("model7.RDS")





#Make a list of all the models
all.models <- list(model1, model3, model4, model5,model6,model7)
names(all.models) <- sapply(all.models, function(x) x$method)
sort(sapply(all.models, function(x) max(x$results$Rsquared)))
##   glmboost     glmnet        knn     cubist        gbm 
## 0.04436680 0.04440788 0.32136178 0.82354712 0.82696916
#Make a linear/gbm/rf ensemble

results <- readRDS("results.RDS")
gbm <- readRDS("gbm.RDS")
rf<- readRDS("gbm.RDS")


max(gbm$error$Rsquared)
## [1] 0.8295693
#Predict for test set:
#preds <- data.frame(sapply(all.models, predict, newdata=test[,-c(1,23)]))
#preds$ENS_gbm <- predict(gbm, newdata=test[,-c(1,23)])
#preds$ENS_rf <- predict(rf, newdata=test[,-c(1,23)])
#preds$ENS_glmnet <- predict(ak,newdata=test[,-c(1,23)])

Rsquare value for our stacked gbm model is slightly higher than individual model which also means that the stacked model is working a little better but we also need to consider other type of models to enhance this

13 Future Work

I have neglected a lot of things which I can do but just for the sake of saving time I have not mentioned them above so I would like to put them in future work:

  1. Model Explainability for black box models

  2. Trying out other ways of missing value imputation and its effects on accuracy metrics

  3. Feature distribution and outlier detection with respect to univariate and multivariatez data(DBSCAN)

  4. Feature engineering with respect to other columns

  5. Stacking with more models